GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications

نویسندگان

  • Bertil Folliot
  • Pierre Sens
چکیده

This paper presents how and why to unify load sharing and fault tolerance facilities. A realization of a fault tolerant load sharing facility, GATOSTAR, is presented and discussed. It is based on the integration of two applications developed on top of Unix: GATOS and STAR. GATOS is a load sharing manager which automatically distributes parallel applications among heterogeneous hosts according to multicriteria allocation algorithms. STAR is a software fault tolerance manager which automatically recovers processes of faulty machines based on checkpointing and message logging. The main advantage of this approach is to increase fault tolerant performance by taking advantage of the load sharing policies when allocating or recovering processes. This unification not only improves the efficiency of both facilities but avoids many redundancies mechanisms between them. Indeed, each facility needs to manage at least three common features: global knowledge of the running processors, a crash detection mechanism and remote process management. The backbone of this unification is a logical ring of communication for host crash detection and for host related information transfer. Thus, all necessary information is acquired with a relatively low cost of messages compared to the two systems taken independently.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Load Sharing Control of Parallel Inverters with Uncertainty in the Output Filter Impedances for Islanding Operation of AC Micro-Grid

Parallel connection of inverter modules is a solution to increase reliability, efficiency and redundancy of inverters in Micro-Grid system. Proper load sharing among parallel inverters is a key point. The circulating current among the inverters can greatly reduce the efficiency or even cause instability of the system. In this paper, a control strategy for improving the load sharing performance ...

متن کامل

Group-Strategyproof Cost Sharing for Metric Fault Tolerant Facility Location

In the context of general demand cost sharing, we present the first group-strategyproof mechanisms for the metric fault tolerant uncapacitated facility location problem. They are (3L)-budget-balanced and (3L · (1 +Hn))-efficient, where L is the maximum service level and n is the number of agents. These mechanisms generalize the seminal Moulin mechanisms for binary demand. We also apply this app...

متن کامل

Fault Tolerant Scheduling for Parallel Loops on Shared Memory Systems

While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this p...

متن کامل

Deadlock-Free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

We have developed a hardware detour path selection facility for the Hitachi SR2201 parallel computer, which uses a multi-dimensional crossbar as an inter-processor network to ensure operating efficiency and high reliability when a part of the network is faulty. When this hardware facility is used, packets are transmitted to their destination along alternative paths to avoid the fault. However, ...

متن کامل

Fault tolerant system with imperfect coverage, reboot and server vacation

This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994